This is a first post from a series of articles on Self Organizing Maps (SOM). The next post will cover the topic regarding analyzing the SOM and its quality.

Introduction

The purpose of this short article is to create a Self Organizing Map (SOM) using the Kohonen R package, which could help understand the whole Warsaw Stock Exchange (WSE) market based on financial and economic data. Moreover, in this article I write my simple own function using data.table, purrr and magrittr R packages to efficiently download data from stooq. Note that this analysis closely follows what is already presented here for the US stock market.

Self Organizing Maps

To get to know better or the basics of Self Organizing Maps, I recommend to read as an introduction this Wikipedia article and as an extension the article from the authour of SOM algorithm Teuvo Kohonen.

Building the Self Organizing Map

Getting Data

In the previous blog post I tried various ways to download the financial data using funciton tq_get() from the tidyquant package. This time however I will utilize a slightly different approach. I will write my own function to download the data from the stooq website, which uses data provided by VWD. Unfortunately, the stooq website is available only in polish language. Although the approach to get the data doesn’t require the knowledge of polish. The first step is to specify the tickers for the following financial/economic data: * Warsaw Stock Exchange Index * price/equity ratio for the whole stock market * price/book value ratio for the whole stock market * dividend yield for the whole stock market * unemployment rate * consumer price index y/y * wibor pln 1y - interest rate * 10y bond yield

Please note that further in the R script and the article I will stick to the tickers mentioned in the folloing block of code.

tickers <- c("wig", #Warsaw Stock Exchange Index
             "wig_pe", #price/equity ratio for the whole stock market
             "WIG_PB", #price/book value ratio for the whole stock market
             "wig_dy", #dividend yield for the whole stock market
             "UNRTPL.M", #unemployment rate
             "CPIYPL.M", #consumer price index y/y
             "PLOPLN1Y", #intrest rate- wibor 1y
             "10PLY.B") #10y bond yield

To get data form stooq website I write a simple function read_stooq_data and then I apply it to all tickers using a map function from the purrr package. In this function I utilize the fread() function from the data.table package to download as a data.table object contents of a web url. The frequency of data is set to daily.

read_data_stooq <- function(x){
  fread(paste0("https://stooq.pl/q/d/l/?s=",x,"&i=d"), fill=TRUE) %>% 
    select(Date=Data, !!x:=Zamkniecie)
} #'zamkniecie' means close value

data_all <- purrr::map(tickers, read_data_stooq) 

Note that in the read_data_stooq function I deliberately used double exclamation mark to pass each ticker name, as column name.

Manipulating & Wrangling Data

I have data on all tickers saved as data.tables within a list. It would be the best to merge those different elements of the list into a single data.table. For this purpose I use the following function downloaded from this blog.

multi_join <- function(list_of_loaded_data, join_func, ...){
  output <- Reduce(function(x, y) {join_func(x, y, ...)}, list_of_loaded_data)
}

merged_data <- multi_join(data_all, full_join) 
merged_data_interpolate <- merged_data %>%
  # pivot_longer(cols="Date",values_to ="series" )
  reshape2::melt(id.vars = 'Date', variable.name = 'series')

To make sure that no data is lost in the merge process I utilized the full_join option. Otherwise, as the frequency or the availability of data is different I could lose some important data. To work further with the som_data object I have to use the na.approx() function from the zoo package to fill all NA values.

merged_data_interpolate <- merged_data_interpolate %>% 
  arrange(series, Date) %>% 
  group_by(series) %>% 
  mutate_at(vars(matches("value")), 
            list(value=na.approx), 
            na.rm=FALSE, 
            rule=1:2,
            method="linear")
 
merged_data <- merged_data_interpolate %>% 
  pivot_wider(names_from = series, values_from = value) %>% 
  arrange(Date)

som_data <- merged_data  %>% 
  na.omit()

To make sure that all NA values are properly approximated I use the option rule=1:2. It describes how interpolation is taking place outside the interval [min(x), max(x)]. If rule is 1 then NAs are returned for such points and if it is 2, the value at the closest data extreme is used. Economic data on inflation or unemployment is published only once in a while. In periods in between there were NAs. To fill them I could use either a linear function, a spline or a step-function. I decided to go with a linear function. As a validation setp I apply also the na.omit() function to delete any rows with at least one NA value.

Finally, the resulted data set starts on:

som_data %>% 
  slice(1) %>% 
  pull(Date) %>% 
  as.Date()
## [1] "2007-10-19"

and ends on:

som_data %>% 
  slice(n()) %>% 
  pull(Date) %>% 
  as.Date()
## [1] "2020-01-24"

The final step is to ensure that the data used for the SOM is scaled to account for the corectness of distance.

som_data_no_date <- som_data %>% 
  select(-Date)

som_data_scaled <- apply(som_data_no_date, 2, scale)

Sanity check

As a sanity check I plot all variables in the data set, to investigate whether there are clear outliers.

df <- reshape2::melt(som_data, id.vars = 'Date', variable.name = 'series')
#obviously, one could also utilize the pivot_longer() function
sanity_plots <- ggplot(df, aes(Date,value,group = 1)) + 
  geom_line() + 
  facet_grid(series ~ ., scales = "free")

ggplotly(sanity_plots) %>%
  layout(legend = list(orientation = "h", x = 0.4, y = -0.2))

No apparent errors are visible. Maybe only one a sharp decrease in wig_dy variable around 2009 should be investigated.

Building the map

Using kohonen package is effortless. Firstly, the grid for the Self Organizing Map needs to be specified. A simple 5x5 hexagonal grid should work out. Neighbourhood function specifies how the distance between 2 neurons in any step should be calculated. rlen specifies the number of updates. alpha is the monotically decreasing learning parameter. Additionally, I’m setting a seed to ensure that this analysis is reproducible.

set.seed(1029)
som_grid <- somgrid(xdim = 5, ydim = 5, topo = "hexagonal", neighbourhood.fct="gaussian")

som_model <- som(som_data_scaled,
                 grid = som_grid,
                 rlen = 300,
                 alpha = c(0.05, 0.01),
                 keep.data = TRUE)

The main point of any SOM analysis is to produce a map. Of course a coloful map is more meaningful than a map in different shades of grey or any other color.

# Set heat colors palette
heatColors <- function(n, alpha = 1) {
  rev(designer.colors(n = n, col = brewer.pal(9, "Spectral")))
}

Also a function for plotting SOM heatmaps for each variable would be helpful.

plot_som <- function(variable) {
  unit_colors <- aggregate(data.frame(som_data[, variable]),
                           by = list(som_model$unit.classif),
                           FUN = mean,
                           simplify = TRUE)
  
  plot(som_model,
       type = "property",
       shape = "straight",
       property = unit_colors[, 2],
       main = variable,
       palette.name = heatColors)
}

Once properly equipped with tools it is possible to plot all variables that are categorized by a SOM.

The next step is to create the type="codes" plot to understand what is the contribution of each variable to each cluster.

plot(som_model,
     shape = "straight",
     type = "codes")

Finally, the plotted clusters are presented. It is assumed that 4 clusters are created.

palette <- c("#F25F73", "#98C94C", "#888E94", "#33A5BF", "#F7D940")
som_cluster <- cutree(hclust(dist(as.data.frame(som_model$codes))), 4)

plot(som_model,
     type = "mapping",
     bgcol = palette[som_cluster],
     main = "Clusters",
     shape = "straight")

Summary

A simple Self Organizing Map helping to help explain the Warsaw Stock Exchange market was created in this article. In the next one it will be assessed in terms of the quality of map, as well as analysis findings.